Analysis of Some Methods for Reduced Rank Gaussian Process Regression

نویسندگان

  • Joaquin Quiñonero Candela
  • Carl E. Rasmussen
چکیده

While there is strong motivation for using Gaussian Processes (GPs) due to their excellent performance in regression and classification problems, their computational complexity makes them impractical when the size of the training set exceeds a few thousand cases. This has motivated the recent proliferation of a number of cost-effective approximations to GPs, both for classification and for regression. In this paper we analyze one popular approximation to GPs for regression: the reduced rank approximation. While generally GPs are equivalent to infinite linear models, we show that Reduced Rank Gaussian Processes (RRGPs) are equivalent to finite sparse linear models. We also introduce the concept of degenerate GPs and show that they correspond to inappropriate priors. We show how to modify the RRGP to prevent it from being degenerate at test time. Training RRGPs consists both in learning the covariance function hyperparameters and the support set. We propose a method for learning hyperparameters for a given support set. We also review the Sparse Greedy GP (SGGP) approximation (Smola and Bartlett, 2001), which is a way of learning the support set for given hyperparameters based on approximating the posterior. We propose an alternative method to the SGGP that has better generalization capabilities. Finally we make experiments to compare the different ways of training a RRGP. We provide some Matlab code for learning RRGPs. 1 Motivation and Organization of the Paper Gaussian Processes (GPs) have state of the art performance in regression and classification problems, but they suffer from high computational cost for learning and predictions. For a training set containing n cases, the complexity of training is O(n3) and that of making a prediction is O(n) for computing the predictive mean, and O(n2) for computing the predictive variance. A few computationally effective approximations to GPs have recently been proposed. These include the sparse iterative schemes of Csató (2002), Csató and Opper (2002), Seeger (2003), and Lawrence et al. (2003), all based R. Murray-Smith, R. Shorten (Eds.): Switching and Learning, LNCS 3355, pp. 98–127, 2005. c © Springer-Verlag Berlin Heidelberg 2005 Analysis of Some Methods for Reduced Rank Gaussian Process Regression 99 on minimizing KL divergences between approximating and true posterior; Smola and Schölkopf (2000) and Smola and Bartlett (2001) based on low rank approximate posterior, Gibbs and MacKay (1997) and Williams and Seeger (2001) on matrix approximations and Tresp (2000) on neglecting correlations. Subsets of regressors (Wahba et al., 1999) and the Relevance Vector Machine (Tipping, 2001) can also be cast as sparse linear approximations to GPs. Schwaighofer and Tresp (2003) provide a very interesting yet brief comparison of some of these approximations to GPs. They only address the quality of the approximations in terms of the predictive mean, ignoring the predictive uncertainties, and leaving some theoretical questions unanswered, like the goodness of approximating the maximum of the posterior. In this paper we analyze sparse linear or equivalently reduced rank approximations to GPs that we will call Reduced Rank Gaussian Processes (RRGPs). We introduce the concept of degenerate Gaussian Processes and explain that they correspond to inappropriate priors over functions (for example, the predictive variance shrinks as the test points move far from the training set). We show that if not used with care at prediction time, RRGP approximations result in degenerate GPs. We give a solution to this problem, consisting in augmenting the finite linear model at test time. This guarantees that the RRGP approach corresponds to an appropriate prior. Our analysis of RRGPs should be of interest in general for better understanding the infinite nature of Gaussian Processes and the limitations of diverse approximations (in particular of those based solely on the posterior distribution). Learning RRGPs implies both selecting a support set, and learning the hyperparameters of the covariance function. Doing both simultaneously proves to be difficult in practice and questionable theoretically. Smola and Bartlett (2001) proposed the Sparse Greedy Gaussian Process (SGGP), a method for learning the support set for given hyperparameters of the covariance function based on approximating the posterior. We show that approximating the posterior is unsatisfactory, since it fails to guarantee generalization, and propose a theoretically more sound greedy algorithm for support set selection based on maximizing the marginal likelihood. We show that the SGGP relates to our method in that approximating the posterior reduces to partially maximizing the marginal likelihood. We illustrate our analysis with an example. We propose an approach for learning the hyperparameters of the covariance function of RRGPs for a given support set, originally introduced by Rasmussen (2002). We also provide Matlab code in Appendix B for this method. We make experiments where we compare learning based on selecting the support set to learning based on inferring the hyperparameters. We give special importance to evaluating the quality of the different approximations to computing predictive variances. The paper is organized as follows. We give a brief introduction to GPs in Sect. 2. In Sect. 3 we establish the equivalence between GPs and linear models, showing that in the general case GPs are equivalent to infinite linear models. We also present degenerate GPs. In Sect. 4 introduce RRGPs and address the 100 Joaquin Quiñonero-Candela and Carl Edward Rasmussen issue of training them. In Sect. 5 we present the experiments we conducted. We give some discussion in Sect. 6. 2 Introduction to Gaussian Processes In inference with parametric models prior distributions are often imposed over the model parameters, which can be seen as a means of imposing regularity and improving generalization. The form of the parametric model, together with the form of the prior distribution on the parameters result in a (often implicit) prior assumption on the joint distribution of the function values. At prediction time the quality of the predictive uncertainty will depend on the prior over functions. Unfortunately, for probabilistic parametric models this prior is defined in an indirect way, and this in many cases results in priors with undesired properties. An example of a model with a peculiar prior over functions is the Relevance Vector Machine introduced by Tipping (2001) for which the predictive variance shrinks for a query point far away from the training inputs. If this property of the predictive variance is undesired, then one concludes that the prior over functions was undesirable in the first place, and one would have been happy to be able to directly define a prior over functions. Gaussian Processes (GPs) are non-parametric models where a Gaussian process prior is directly defined over function values. The direct use of Gaussian Processes as priors over functions was motivated by Neal (1996) as he was studying priors over weights for artificial neural networks. A model equivalent to GPs, kriging, has since long been used for analysis of spatial data in Geostatistics (Cressie, 1993). In a more formal way, in a GP the function outputs f(xi) are a collection random variables indexed by the inputs xi. Any finite subset of outputs has a joint multivariate Gaussian distribution (for an introduction on GPs, and thorough comparison with Neural Networks see (Rasmussen, 1996)). Given a set of training inputs {xi|i = 1, . . . , n} ⊂ R (organized as rows in matrix X), the joint prior distribution of the corresponding function outputs f = [f(x1), . . . , f(xn)] is Gaussian p(f |X, θ) ∼ N (0,K), with zero mean (this is a common and arbitrary choice) and covariance matrix Kij = K(xi,xj). The GP is entirely determined by the covariance function K(xi,xj) with parameters θ. An example of covariance function that is very commonly used is the squared exponential: K(xi,xj) = θ D+1 exp ( − 2 D ∑ d=1 1 θ2 d (Xid −Xjd) )

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Transductive and Inductive Methods for Approximate Gaussian Process Regression

Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare experimentally three of the leading approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, ...

متن کامل

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

This paper proposes a novel scheme for reduced-rank Gaussian process regression. The method is based on an approximate series expansion of the covariance function in terms of an eigenfunction expansion of the Laplace operator in a compact subset of R. On this approximate eigenbasis the eigenvalues of the covariance function can be expressed as simple functions of the spectral density of the Gau...

متن کامل

Gaussian Processes for Regression and Optimisation

Gaussian processes have proved to be useful and powerful constructs for the purposes of regression. The classical method proceeds by parameterising a covariance function, and then infers the parameters given the training data. In this thesis, the classical approach is augmented by interpreting Gaussian processes as the outputs of linear filters excited by white noise. This enables a straightfor...

متن کامل

Multivariate reduced rank regression in non-Gaussian contexts, using copulas

We propose a new procedure to perform Reduced Rank Regression (RRR) in nonGaussian contexts, based on Multivariate Dispersion Models. Reduced-Rank Multivariate Dispersion Models (RR-MDM) generalise RRR to a very large class of distributions, which include continuous distributions like the normal, Gamma, Inverse Gaussian, and discrete distributions like the Poisson and the binomial. A multivaria...

متن کامل

Tensor Regression Meets Gaussian Processes

Low-rank tensor regression, a new model class that learns high-order correlation from data, has recently received considerable attention. At the same time, Gaussian processes (GP) are well-studied machine learning models for structure learning. In this paper, we demonstrate interesting connections between the two, especially for multi-way data analysis. We show that low-rank tensor regression i...

متن کامل

A robust autoregressive gaussian process motion model using l1-norm based low-rank kernel matrix approximation

This paper considers the problem of modeling complex motions of pedestrians in a crowded environment. A number of methods have been proposed to predict the motion of a pedestrian or an object. However, it is still difficult to make a good prediction due to challenges, such as the complexity of pedestrian motions and outliers in a training set. This paper addresses these issues by proposing a ro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003